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Abstract 

Background: A proportional hazards measure is suggested in the context of analyzing SROC curves that arise in the 
meta-analysis of diagnostic studies. The measure can be motivated as a special model: the Lehmann model for ROC 
curves. The Lehmann model involves study-specific sensitivities and specificities and a diagnostic accuracy parameter 
which connects the two. 

Methods: A study-specific model is estimated for each study, and the resulting study-specific estimate of diagnostic 
accuracy is taken as an outcome measure for a mixed model with a random study effect and other study-level 
covariates as fixed effects. The variance component model becomes estimable by deriving within-study variances, 
depending on the outcome measure of choice. In contrast to existing approaches - usually of bivariate nature for the 
outcome measures - the suggested approach is univariate and, hence, allows easily the application of conventional 
mixed modelling. 

Results: Some simple modifications in the SAS procedure proc mixed allow the fitting of mixed models for 
meta-analytic data from diagnostic studies. The methodology is illustrated with several meta-analytic diagnostic data 
sets, including a meta-analysis of the Mini-Mental State Examination as a diagnostic device for dementia and mild 
cognitive impairment. 

Conclusions: The proposed methodology allows us to embed the meta-analysis of diagnostic studies into the 
well-developed area of mixed modelling. Different outcome measures, specifically from the perspective of whether a 
local or a global measure of diagnostic accuracy should be applied, are discussed as well. In particular, variation in 
cut-off value is discussed together with recommendations on choosing the best cut-off value. We also show how this 
problem can be addressed with the proposed methodology. 

Keywords: Diagnostic accuracy, Mixed modelling, Random effects modelling, Cut-off value modelling, 
SROC modelling 



Background 

We are interested in the following setting occurring in the 
field of meta-analysis of diagnostic studies (Hasselblad 
and Hedges [1]; Sutton et al. [2]; Deeks [3]; Schulze 
et al. [4]): a variety of diagnostic studies are available pro- 
viding estimates of the diagnostic measures of specificity 
q = P(T — 0|£) = 0) as qt = Xi/ni and of sensitivity 
p = P(T = 1 \D = 1) as pi = yi/mu where D = 1 
and D = 0 denote presence or absence of disease, 
respectively, and T = 1 or T = 0 denote positivity 
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or negativity of the diagnostic test, respectively, %i are 
the number of observed true-negatives out of Yi{ healthy 
individuals, and yi are the number of observed true- 
positives out of mi diseased individuals, for i = 1, . . . , k, 
k being the number of studies. For more details on the 
statistical modelling of the diagnostic data from a single 
study, see Pepe [5,6]. For a more detailed introduction 
to meta-analysis of diagnostic studies, see Holling et al. 
[7]. In the following, we will look at several examples - 
mainly from medicine and psychology - for this special 
meta-analytic situation. In principle, however, applica- 
tions could occur in all areas in which meta-analytic data 
is encountered; Swets [8] considers mainly psychological 
applications, but also mentions cases from engineering 
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(quality control), manufacturing (failing parts in planes), 
metereology (correctness of weather predictions), infor- 
mation science (correctness of information retrieval), or 
criminology (correctness of lie detection test). We illus- 
trate the special meta-analytic situation mentioned above 
with a meta-analysis on a diagnostic test on heart failure 
(see also Holling etal. [7]). 

Example 1: Meta- Analysis of diagnostic accuracy of 
Brain Natriuretic Peptides (BNP) for heart failure. Doust 
et al. [9] provide a meta-analysis on the diagnostic accu- 
racy of the brain natriuretic peptides (BNP) procedure as a 
diagnostic test for heart failure. According to the authors, 
diagnosis of heart failure is difficult, with both overdiag- 
nosis and underdiagnosis occurring. The meta-analysis 
considers a range of diagnostic studies that use different 
reference standards (where a reference standard defines 
the presence or absence of disease). Here we only consider 
the eight studies (see Table 1) using the left ventricular 
ejection fraction of 40% or less as reference standard. 

The cut-off value problem. A separate meta-analysis of 
sensitivity and specificity using the meta-analytic tools 
for independent binomial samples is problematic when 
the underlying diagnostic test utilizes a continuous or 
ordered categorical scale and different cut-off values have 
been used in different diagnostic studies. A simple varia- 
tion of the cut-off value from study to study might lead 
to quite different values of sensitivity and specificity with- 
out any actual change in the diagnostic accuracy of the 
underlying test. 

SROC curve. Due to this comparability problem for 
sensitivity and specificity, interest is usually focussed on 
the summary receiver operating characteristic (SROC) 
curve consisting of the pairs (1 — q(t),p(t)) where q(t) = 
P(T < t\D = 0) and p(t) = P(T > t\D = 1) for a con- 
tinuous test T with potential value t. For a given study i, 
i = with potentially unknown cut-off value tu 



Table 1 Meta-analysis of of diagnostic accuracy of brain 
natriuretic peptides (BNP) for heart failure using the left 
ventricular ejection fraction of 40% or less as reference 
standard 



Diseased 



Healthy 



Study / 


y,(TP) 


#n,-y,(FN) 


*/(TN) 


n/-x,(FP) 


nj + m 


Bettencourt2000 


29 


7 


46 


19 


101 


Choy 1 994 


34 


6 


22 


13 


75 


Valli 2001 


49 


9 


78 


17 


153 


Vasan 2002a 


4 


6 


1612 


85 


1707 


Vasan 2002b 


20 


40 


1339 


71 


1470 


Hutcheon 2002 


29 


2 


102 


166 


299 


Landray 2000 


26 


14 


75 


11 


126 


Smith 2000 


11 


1 


93 


50 


155 



the pairs (1 — q(ti),p(ti)) can be estimated by (1 — qupi) = 
(1 —Xi/nuyi/mi) for i = 1, . . . , k. The SROC curve accom- 
modates the cut-off value problem. Different pairs could 
have quite different values of specificity and sensitivity, 
but still reflect identical diagnostic accuracy. The SROC 
diagram for the meta-analysis on BNP and heart failure is 
given in Figure 1. 

Clearly, there is a wide range of values for specificity 
and sensitivity. Nevertheless, as Figure 1 shows, the pos- 
sibility that the pairs might stem from a common SROC 
curve (as given by the dashed curve in Figure 1) cannot 
be discarded. Since the SROC approach accommodates 
the cut-off value problem, it is commonly preferred to 
summary measures like the Youden index [10] or the diag- 
nostic odds ratio [11]. In the following, we focus our 
analysis on the SROC curve. 

Background of SROC modelling. SROC modelling has 
received considerable attention in the field and experi- 
enced several developments. An early model was sug- 
gested by Littenberg and Moses [12], [13] and has been 
used in practice frequently; Deeks [3] discusses its promi- 
nent role in modeling meta-analytic diagnostic study 
accuracy. Littenberg and Moses [13] suggest fitting D = 
a + PS, where D = logDOR = log ^ - log ±=2 is 

the log- diagnostic odds ratio and S = log + log 
is a measure for a potential threshold effect. After a and 
P have been estimated from the data, the SROC-curve 
(p vs. 1 — q) is reconstructed from the estimated values of 
a and p. The parameter a is interpreted as the summary 




0.0 0.1 



0.6 



0.2 0.3 0.4 0.5 
1 -specificity 

Figure 1 SROC diagram for BNP and heart failure: circles are the 
observed pairs of false positive rate and sensitivity, dashed 
curve is lowess smoother. 
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log-DOR, which is adjusted by means of S for potential 
cut-off value effect. 

A two-level approach has been suggested by Rutter and 
Gatsonis [14], which is typically given in the following 
notational form (Walter and Macaskill [15]): let Zy ~ 
Bi(nij,7iij), where Z,y is the number of test-positives in 
study i for arm / (j = 1 is diseased, / = 2 is non-diseased), 
nij is the size of arm / in study i and nn is the sensitiv- 
ity, 7T/2 is the false positive rate; the model is log = 
(Oi+otiDSij) exp(—/3DSij), where 0; is an implicit threshold 
parameter for study i, at is the diagnostic accuracy param- 
eter in study i, and DSij represents a binary variable for the 
disease status. The parameter /3 allows for an association 
between test accuracy and test threshold. When = 0, oL{ 
is estimated by A and 0; is estimated by Si/2, where A and 
Si are as for the Littenberg-Moses model. Furthermore, 
to account for between-study variation, a random effect is 
assumed for 0; ~ iV(0, Xq) and at ~ iV(A, r^), with 0/ and 
oti being independent. As an alternative, a bivariate nor- 
mal random-effects meta-analysis has been suggested by 
van Houwelingen et al. [16]; see also Reitsma et al [17] 
and Arends et al [18]. Harbord et al [19] show that these 
models are closely related. 

Paper overview. In the following, we propose a spe- 
cific model, called the Lehmann model, which we believe 
is very attractive for the analysis of SROC curves. The 
model involves study-specific sensitivities and specifici- 
ties and a diagnostic accuracy parameter which connects 
the two. The Lehmann model achieves flexibility by allow- 
ing the diagnostic accuracy parameter to become a ran- 
dom effect. In this it is similar to the Rutter- Gatsonis 
model, but differs in that it retains univariate dimen- 
sionality in its outcome measure and, hence, allows a 
mixed model approach in a more conventional way. In 
section "The proportional hazards measure", the propor- 
tional hazards measure is motivated as a specific form 
of SROC curve modelling and is compared to other 
approaches. Section "A mixed model approach" intro- 
duces the specific mixed model in which the log propor- 
tional hazards measure forms the outcome measure, the 
study factor is a normally distributed random effect (to 
cope with unobserved heterogeneity), and other observed 
covariates (such as gold standard or diagnostic test varia- 
tion) are considered as fixed effects in the mixed model. 
Section "Results" considers various applications includ- 
ing a meta-analysis of the Mini-Mental State Examination 
to diagnose dementia or mild cognitive impairment. It 
also provides S AS -code for a simple execution of the sug- 
gested approach. In section "Discussion", the choice of 
outcome is discussed and the difference between global 
and local diagnostic accuracy measures highlighted. This 
is particularly of interest if observed cut-off value varia- 
tion occurs in the meta-analysis and needs to be assessed. 



Here a local criterion of diagnostic accuracy appears more 
appropriate. The paper ends with some brief conclusions 
and discussion in section "Conclusions". 

Methods 

The proportional hazards measure 

Numerous summary measures for a pair of specificity 
and sensitivity have been suggested: we mention here the 
Youden index, // = pi + qi — 1 [10], and the squared 
Euclidean distance to the upper left corner in the SROC 
diagram, Ei = (1 — pi) 1 + (1 — qi) 1 . [A review of summary 
measures is given in Liu [20].] Using an average over any 
of these measures might be problematic: not only might 
sensitivities and specificities be heterogeneous, this might 
also be true for the associated summary measures such as 
the Youden index or the Euclidean distance (as demon- 
strated by Figure 2 using the data of the meta-analysis of 
BNP and heart failure). 

We suggest using the measure 0 = \o^- q ) > which 
relates the log-sensitivity to the log-false positive rate; we 
call it the proportional hazards (PH) measure. In Figure 3 
we see that this measure shows a reduced variability for 
the meta-analysis of BNP and heart failure, making it 
more suitable as an overall measure in the meta-analysis 
of diagnostic studies or diagnostic problems. While the 
measure appears to be like any other summary measure 
of the pair sensitivity and specificity, it has a specific 
SROC-modelling background and motivation. We have 
mentioned previously the cut-off value problem: observed 
heterogeneity might be induced by cut-off value variation 
which could lead to different sensitivities and specifici- 
ties - despite the accuracy of the diagnostic test itself not 
having changed - and might also lead to an induced het- 
erogeneity in the summary measure. Hence, it is unclear 
whether the observed heterogeneity is due to heterogene- 
ity in the diagnostic accuracy (authentic heterogeneity) 
or whether it has occurred due to cut-off value variation 
(artificial heterogeneity). This second form of hetero- 
geneity can also occur when the background population 
changes with the study. 

One of the features of the SROC approach is that it 
incorporates the cut-off value variation in a natural way; 
hence a measure modelling an ROC curve is favorable. We 
suggest the PH measure based upon the Lehman family in 
the following way: 

p = (l-qf. (1) 

This model was suggested by Le [21] for the ROC curve. 
It is an appropriate model since, for feasible q, (1 — qf 
is also feasible as long as 0 is positive. Note that (1) is 
defined for all values oip e[0, 1] and q e [0, 1] whereas 
0 = tog/ffL is only defined for p e (0, 1) and q e (0, 1). 
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Figure 2 Index plots for sensitivity, specificity, Youden index, and Euclidean distance showing the wide variability of these measures for 
the data of the meta-analysis of BNP and heart failure. 



Population values of sensitivity and specificity of 1 are 
rarely realistic, although observed values of 1 for sensitiv- 
ity and specificity do occur in samples . This can be coped 
with by using an appropriate smoothing constant such as 
estimating specificity as (ni — l)/rii when %i = ni and 
sensitivity as (m; — if yi = mi. 

In Figure 4 we see a number of examples of the propor- 
tional hazards family. It becomes clear now why 0 is called 
the proportional hazards measure. By taking logarithms 
on both sides of (1) we achieve 



0=\ogp(t)l log [1 -*(*)], (2) 

meaning if model (1) holds, the ratio of log- sensitivity 
to log-false positive rate is constant across the range of 
possible cut-off value choices t. Hence the name propor- 
tional hazards model, which was suggested in a paper 
by Le [21] and used again in Gonen and Heller [22]. 
The idea of representing an entire ROC curve in a sin- 
gle measure is illustrated in Figure 5. While sensitivity and 
specificity vary over the entire interval (0, 1), the value of 0 
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Figure 3 Index plots for the PH measures for the data of the meta-analysis of BNP and heart failure. 



Charoensawat etal. BMC Medical Research Methodology 201 4, 14:56 
http://www.biomedcentral.eom/1 471-2288/1 4/56 



Page 5 of 1 3 




remains constant. Hence, log-sensitivity is proportional to 
the log-false positive rate. This assumption is similar to an 
assumption used for a model in survival analysis, where it 
is assumed that the hazard rate of interest is proportional 
to the baseline hazard rate; this might have motivated the 



choice of name used by Le [21] and Gonen and Heller [22] 
in this context. 

However, it is not our intention to make the assump- 
tion that an entire SROC curve can be represented by 
model (1); the explanations above are instead meant as a 



ROC 



0.00 0.25 0.50 0.75 1.00 



1.0 



0.8 



& 0.6 

> 



0.4 



0.2 



0.0 



1- specificity 



0.250 



0.225 



0.200 



0.175 



0.150 
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Figure 5 Proportional hazards model and associated PH measure. 
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motivation that the PH-measure is not just another sum- 
mary measure, but can be derived from a ROC modelling 
perspective. We envisage that each study, with associated 
pair of sensitivity and specificity, can be represented by a 
specific PH-model, as illustrated in Figure 6. 

We see indeed that each pair of sensitivity and speci- 
ficity can be associated with its own ROC curve provided 
by 

P = d- q) §i (3) 

where 0/ = log/?;/ log [1 — qi\, so that the curve (3) passes 
exactly through the point (1 — qupi). 

Comparison to other approaches. It remains to be 
seen how appropriate the suggested proportional haz- 
ards model is and how it compares to other existing 
approaches. We emphasize that in our situation we have 
assumed that there is only one pair of sensitivity and false 
positive rate (pi, 1 — qf) per study L Situations where sev- 
eral pairs per study are observed (such as in Aertgeerts 
et al. [23]) are rare. Hence, on the log-scale for sensi- 
tivity and false-positive rate, we are not able to identify 
any straight line model within a study with more than 
one parameter, since this would require at least two pairs 
of sensitivity and specificity per study; see also Riicker 
and Schumacher [24,25]. However, any one-parameter 
straight line model, such as the proposed proportional 



hazards model, is estimable within each study, although 
within-model diagnostics is limited since we are fitting the 
full within study model. Given that sample sizes within 
each diagnostic study are typically at least moderately 
large it seems reasonable to assume a bivariate normal dis- 
tribution for log/? and log(l — q) with means \o%p and 
log(l — q) as well as variances and cr^, respectively, 
and covariance o with correlation p = o/(o p o q ). This 
is very similar to the assumptions in the approach taken 
by Reitsma et al. [17] (see also Harbord et al. [19]), with 
the difference that we are using the log-transformation 
whereas in Reitsma et al. [17] logit- transformations are 
applied. Then, it is a well-known result that the mean 
of the random variable log p (having unconditional mean 
log/?) conditional upon the value of the random vari- 
able log(l — q) (having unconditional mean log(l — q)) is 
provided as 

E(\ogp\ log(l -q)) = \ogp + p — [log(l - q) - log(l - q)} , 

(4) 

which can be written as a + 0[log(l — q)] where a = 
log(p) — 0 log(l — q) and 6 — p°^-. This is an important 
result since it means that, in the log-space, sensitivity and 
false-positive rate are linearly related. Furthermore, if a is 
zero, the proportional hazards model arises. 
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Figure 6 Meta-analysis of BNP and heart failure: each study is represented by its own PH model (1 ) - illustrated for 3 studies. 
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The question then arises why not work with a straight 
line model 



r 2 _ ... _, 



cTf = • • • = <xr, the variance component model would not 



logPllog(i-tf) = « + 0 log(l - 



(5) 



The answer is that such a model is not identifiable 
since we have only one pair of sensitivity and speci- 
ficity observed in each study and it is not possible to 
uniquely determine a straight line by just one pair of 
observations since there are infinitely many possible lines 
passing through a given point in the \ogp - log(l — q) 
space. However, the proportional hazards model as a 
slope-only model is identifiable and it is more plausible 
than other identifiable models such as the intercept-only 
model. Clearly, a logistic-transformation would be more 
consistent with the existing literature [14,15] than the 
log-transformation. However, both models would give a 
perfect fit (within each study) since there are no degrees 
of freedom left for testing the model fit. The situation 
changes when there are repeated observations of sensi- 
tivity and specificity per study available. However, these 
meta-analyses with repeated observations of sensitivity 
and specificity according to cut-off value variation are 
extremely rare. 

A mixed model approach 

With the motivation of the previous sections in mind, 
we assume that k diagnostic studies are available with 
diagnostic accuracies 0\, • • • ,0% where 



0i = 



log(l - q{) 



(6) 



We assume the following linear mixed model for log#;: 

\og§ i = p T x i + 8 i + e i (7) 

where x; is a known covariate vector in study /, 5/ is a 
normally distributed random effect 5/ ~ N(0, r 2 ) with r 2 
being an unknown variance parameter, and e; ~ N(0, of) 
is a normally distributed random error with variance of 
known from the /— th study. 

There are several noteworthy points about the mixed 
model (7). The response is measured on the log-scale, 
where the transformation improves the normal approxi- 
mation and also brings the diagnostic accuracy into a well- 
known link function family: the complementary log-log 
function. The difference of the probability for a positive 
test in the groups with and without the condition is mea- 
sured on the complementary log-log scale. The fixed effect 
part involves a covariate vector x which could contain 
information on study level such as gold standard varia- 
tion, diagnostic test variation, or sample size information. 
It should be noted that there are two variance compo- 
nents, r 2 and of. It is important to have information on 
the second variance component. If the second component 
is unknown, even under the assumption of homogeneity 



be identifiable. Hence, we need to devote some effort to 
derive expressions for the within study variances; this can 
be accomplished using the 8— method as discussed in the 
next section. 

Within study variance. Let us consider (ignoring the 
study index i for the sake of simplicity) 

log (9 = log(- \ogp) - log [- log(l - q)} (8) 

and apply the 8— method. Recall that the variance 
Var T(X) of a transformed random variable T(X) can 
be approximated as [T f (E(X))] 2 Var(X) assuming that the 
variance Var(X) of X is known. Applying this 8— method 
twice gives 



Var\og{-\ogp) « — 



and 



Vfcrlog(-log(l -q)) 



q(l - q)/n 



(9) 



(10) 



(l-^) 2 (log(l-^)) 2 

so that the within study variance for the /-th study is 
provided as 



mi - yi 



+ 



1 miyi(\ogyi/mi) 2 miyn - */)(log(l - *;M)) 2 ' 

(11) 

We acknowledge that the above are estimates of the 
variances of the diagnostic accuracy estimates, but are 
used as if they were the true variances. 

Some important cases. If there are no further covariates, 
two important models are easily identified as special cases 
of (7). One is the fixed effects model 



log0i = + 
and the other is the random effects model 
Iog0 i = l3o + 8 i + € i 



(12) 



(13) 



which have gained some popularity in the meta-analytic 
literature. 

Results 

Case study on MMSE and dementia 

We illustrate the approach with an example and revisit 
a meta-analysis by Mitchell [26] on the diagnostic accu- 
racy of the mini-mental state examination (MMSE) as a 
diagnostic test for the detection of dementia and, more 
recently, mild cognitive impairment (MCI). In this meta- 
analysis 38 studies were included and the entire data are 
reproduced in Table 2. We are interested in the question: is 
there a difference in diagnostic accuracy of the MMSE in 
the detection of dementia and MCI, as Figure 7 suggests. 

We use proc mixed from the SAS software, version 
9.2 for Windows [27], for the analysis (see also Table 3). 
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Table 2 Meta-analysis of the diagnostic accuracy of the 
mini-mental state examination (MMSE) and dementia or 
mild cognitive impairment (MCI) as reference standard; 
TP = true positives, FN = false negatives, FP = false 



positives, TN = true negatives 


Study 


Condition 


TP 


FN 


FP 


TN 


1 


Ly^ri i i Lia 


65 


3 


240 


870 


2 


L^^Z\ 1 1 Lia 


1 1 7 


1 2 


1 0 


1 1 0 


3 


nprfipnti^ 

Ly<3l 1 IV3I 1 Lia 


48 


1 9 


63 


989 


4 


DGmGntis 


1 34 


8 


28 


152 


5 


DGmGntis 


24 


5 


44 


292 


6 


DGmGntis 


67 


1 5 


48 


153 


7 


nprripnti^ 
L^^Z\ 1 1 Lia 


64 


1 7 


1 


71 


8 


nprripnti^ 
Ly^ri i i^ri i Lia 


281 


64 


20 


286 


9 


nprripnti^ 
Ly^ri i i Lia 


1 3 


1 


44 


286 


10 


nprripnti^ 
L^\Z\ 1 1 Lia 


262 


20 


29 


1 77 


1 1 


nprfipnti^ 
L/CI 1 Idl 1 Lia 


143 


1 8 


29 


1 23 


12 




183 


33 


33 


51 


13 


DGmGntis 


22 


1 


152 


140 


14 




1 12 


1 


590 


2091 


1 5 


nprripnti^ 
L^\Z\ 1 \ 1 L i a 


1 52 


81 


1 26 


1009 


16 


nprripnti^ 
L^\Z\ 1 1 Lia 


29 


26 


26 


236 


1 7 


nprripnti^ 
i i l i a 


31 


6 


3 


247 


18 


nprripnti^ 
L^\Z\ 1 \ 1 L i a 


1 0 


3 


1 2 


333 


19 


nprpipnti^ 

L> V3I 1 1 V3I 1 LI a 


707 


88 


1438 


1 0447 


20 


DGmGntis 


181 


108 


1 7 


1 84 


21 


DGmGntis 


59 


29 


23 


74 


22 


DGmGntis 


74 


23 


16 


143 


23 


nprripnti^ 
L^\Z\ 1 1 Lia 


27 


1 2 


26 


209 


24 


nprripnti^ 
L^\Z\ 1 \ 1 L i a 


40 


6 


75 


528 


25 


nprripnti^ 
L^\Z\ 1 1 Lia 


317 


52 


1 73 


578 


26 


nprripnti^ 
L^\Z\ 1 1 Lia 


387 


1 1 6 


16 


54 


27 
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44 


28 


Dementia 


44 


7 


34 


396 


29 


Dementia 


123 


46 


98 


309 


30 


Dementia 


25 


43 


3 


171 


31 


Dementia 


73 


32 


2 


225 


32 


Dementia 


37 


45 


1 


440 


33 


Dementia 


78 


34 


45 


376 


34 


MCI 


72 


12 


53 


214 


35 


MCI 


106 


23 


410 


379 


36 


MCI 


37 


36 


22 


118 


37 


MCI 


67 


30 


22 


75 


38 


MCI 


17 


77 


1 


90 




The values of the dependent variable log are easily con- 
structed from Table 2. We are interested to see if there 
are differences in accuracy for diagnosing MCI compared 



0.2 0.3 
1 -specificity 

Figure 7 SROC diagram for the meta-analysis of MMSE and 
dementia or MCI as reference standard. 



to diagnosing dementia. Hence we have constructed a 
covariate condition which takes the value 1 if the 
study concerns MCI as condition and 0 if the study is on 
dementia. Since we have fixed within-study variances, we 
need to tell proc mixed to incorporate this appropri- 
ately; this can be accomplished by using a weight, W{ = 
I/a- 2 . The random option induces a random effect (here 
study) with associated variance component r 2 , which 
is estimated. However, SAS proc mixed will automat- 
ically fit a within-study variance component (on top of 
the provided variances). To circumvent this mechanism, 
the option parms (1) (1) /hold=2 is used where 
the term hold=2 fixes the second variance component, 
corresponding to the within-study variance multiplier, to 
one. Note that the random effect modelling between- 
study variation is described by a free variance parameter, 
r 2 . For this a starting value needs to be given: we have 
r 2 = 1, although other choices are possible, e.g. r 2 = 0, 
corresponding to the case of no heterogeneity between 
studies. 

The results of the analysis are provided in Table 4. 
It can be seen that there is a significant effect of con- 
dition (dementia/MCI) on the diagnostic accuracy, with 
diagnostic accuracy being significantly higher in stud- 
ies with patients having dementia in comparison to the 
diagnostic accuracy in studies with patients having mild 
cognitive impairment. Nevertheless, not all heterogene- 
ity is explained by this covariate as the random effect 
(study effect) still remains significant, as the bottom part 
of Table 4 shows. 

The inference is based here on a procedure called the 
Wald test. The estimated parameter value is divided by its 
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Table 3 SAS proc mixed adapted for meta-analysis of diagnostic accuracy study data 



SAS statement 

proc mixed data=MMSE method=ml covtest; 

class study condition; 

model logtheta = condition/s; 

weight w; 

random study (condition) ; 
parms (1) (l)/hold=2; 
run; 



estimated standard error, and the result is given in column 
four in Table 4. The likelihood ratio test may be considered 
as an alternative. It is defined as two times the differ- 
ence of the log-likelihood including the effect of interest 
and the log-likelihood not including the effect of interest. 
For the effect of condition in Table 4, we find a value of 
6.8 for the likelihood-ratio test. The Wald test is asymp- 
totically standard normal under the null-hypothesis of 
absence of effect, whereas the likelihood ratio test statis- 
tic is asymptotically chi-squared distributed with degrees 
of freedom equal to the number of parameters associ- 
ated with the effect considered (in this case one). It is 
well-known that the likelihood ratio test is more powerful. 
Here, both tests provide similar p-values, with 0.0091 for 
the likelihood ratio test and 0.0069 for the Wald test; this 
confirms the significance of the effect (dementia/MCI) on 
the diagnostic accuracy. 

It is trivial to construct the associated SROC curves 
from Table 4. We find 

for dementia: p = (1 - ^(-2.2878)^ 

for MCI: p = (1 - ^(-2.2878+0.8605) ^ 

Note that the likelihood ratio test as well as the Wald test 
need modification in situations where the null hypothe- 
sis is part of the boundary of the alternative such as when 

Table 4 Analysis of effects for the meta-analysis of the 
diagnostic accuracy of the mini-mental state examination 
(MMSE) and dementia or mild cognitive impairment (MCI) 
as reference standard 



Effect 


Parameter 
estimate 


SE 


Z-value 


fixed 








Intercept 


-2.2878 


0.1208 


-18.94 


condition 


0.8605 


0.3187 


2.70 


random 
r 2 (study) 


0.3078 


0.1049 


2.90 



Explanation 

procedure mixed of SAS, data contains the data file, method 

specifies estimation 

defines the categorical variables used 

defines the model: LHS outcome, RHS covariates used 

w contains inverse variance as weight 

factor study nested in condition 

specifies starting values, hold=2 fixes the residual variance component 
executes the program 



testing Ho : r 2 = 0. In this case, the asymptotic null distri- 
bution of the likelihood ratio test statistic is no longer / 2 
with 1 df but rather a mixture of a two-mass distribution 
giving equal weights 0.5 to the one-point mass distribu- 
tion at 0 and a x 2 with 1 df [28]. Practically, this means 
that standard 2-sided p-values have to be divided by 2. 

Case study on MOOD and depressive disorders 

The MOOD module of the Patient Health Questionnaire 
(PHQ-9) has been developed to screen and to diagnose 
patients in primary care with depressive disorders. The 
instrument consists of 9 questions, each scored from 0 
to 3 points with a total score ranging from 0 to 27. In 
a meta-analysis of the diagnostic accuracy of MOOD, 
Wittkampf et al [29] included 12 studies. These studies 
used either a cut-off of 10 (referred to here as "sum- 
mary score") or a more complex evaluation algorithm 
("algorithm"). The complete data are listed in Table 5 and 
the associated SROC diagram is given in Figure 8. The 
impression from the graph is that the cut-off of 10 used by 

Table 5 Meta-Ana lysis of the diagnostic accuracy of the 
MOOD module and depression in patients in primary care 
as reference standard; TP = true positives, FN = false 
negatives, FP = false positives, TN = true negatives 



Study 


Cut-off 


TP 


FN 


FP 


TN 


1 


algorithm 


65 


26 


104 


1192 


2 


algorithm 


70 


13 


74 


846 


3 


sum score 


62 


10 


27 


429 


4 


sum score 


36 


5 


65 


474 


5 


sum score 


55 


11 


43 


392 


6 


algorithm 


6 


8 


12 


144 


7 


sum score 


121 


103 


80 


720 


8 


algorithm 


11 


5 


5 


76 


9 


algorithm 


6 


5 


0 


3 


10 


algorithm 


85 


31 


9 


460 


11 


sum score 


15 


1 


4 


42 


12 


sum score 


96 


10 


23 


187 
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Figure 8 SROC diagram for the meta-analysis of MOOD and 
depression in patients in primary care. 



Discussion 

Global versus local criteria 

We have focussed on the PH measure so far, as it pro- 
vides an appropriate measure for comparing SROC curves 
globally, in the sense that cut-off value variation will not 
necessarily effect the estimate of the SROC curve. The 
situation is illustrated in Figure 9. 

Evidently, different cut-off values are associated with the 
same value of logf9, hence, the PH measure \o%6 is not 
the best measure to discriminate different cut-off values. 
This is not surprising, since the SROC curve is a concept 
designed for assessing the diagnostic accuracy of a diag- 
nostic test globally, in the sense that it adjusts for different 
cut-off values. Hence, a measure that assesses local per- 
formance of the diagnostic is needed. Assuming that every 
cut-off value used in the meta-analysis is clinically mean- 
ingful, we suggest use of the (squared) Euclidean distance 
to the upper left corner (0, 1) of the ROC diagram as a 
more meaningful measure to compare cut-off values: 



E i = {i-p i y + (i-q i y 



(14) 



the summary score has a higher diagnostic accuracy than 
the alternative. 

The presence or absence of a cut-off value effect is now 
more formally investigated using a covariate cut-off, 
which is zero when the summary score with a cut-off value 
of 10 is used and one otherwise. The results are presented 
in Table 6. It can be seen that the covariate cut-off 
level "summary score" is associated with a higher diag- 
nostic accuracy, although, as seen from the Wald statistics 
provided in column four of Table 6, the effect is not signif- 
icant. We see a significant random effect (study; adjusted 
p-value 0.0274; see comment at the end of section 
"Case study on MMSE and dementia"), which indicates 
that the random study effect is needed in the analysis. It 
is not really surprising that the covariate cut-off is not 
significant, since the concept of the SROC is designed to 
accommodate the cut-off value variation. We will take up 
this point in the next section. 



Table 6 Analysis of the cut-off effect for the meta-analysis 
of the MOOD module and depression in patients in 
primary care 



Effect 


Parameter 
estimate 


SE 


Z-value 


fixed 








Intercept 


-2.5332 


0.2817 


-8.99 


cut-off 


0.4804 


0.3966 


1.21 


random 
r 2 (study) 


0.3239 


0.1690 


1.92 



where pi = yi/mt and qi = Xi/rii. Each point in the 
SROC diagram has a unique circle with center (0, 1) that 
passes through this point. In Figure 9, one such circle is 
illustrated which also has the smallest Euclidean distance 
among the three available points (since it has smallest 
radius among the three possible points with associated 
circles). In the following, we will consider the criterion 
(14) as an alternative criterion to choose the cut-off value. 




Figure 9 Different cut-off values with associated sensitivities 
and specificities on the same SROC curve with different 
Euclidean distances; the point on the circle has shortest 
Euclidean distance to the upper left vertex of the SROC diagram 
as indicated by the circle. 
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Since we have changed the criterion, we need to deter- 
mine the associated within-study variances. This can be 
accomplished easily, using the 8 -method once more, to 
obtain 



VariE) ^ 4(l-p) 2 ^(l + 4(1 - 



-qfqil- 



-q)/n, (15) 



where we have ignored study indexes for the the sake 
of simplicity. Using this criterion, we see in Figure 9 
that cut-off values can vary considerably in their diag- 
nostic accuracy, despite having identical diagnostic 
accuracy at a global level. We re-analyze the meta- 
analysis of MOOD and depression with respect to the 
(squared) Euclidean distance and provide the results in 
Table 7. 

Evidently, both criteria lead to the same conclusion, 
namely that using the summary score with a cut-off value 
of 10 leads to the higher diagnostic accuracy (although 
the effect is not significant). It might also be worthwhile 
looking at the results of the likelihood ratio test: for the 
PH-measure as the outcome variable, the likelihood ratio 
test provides a value of 1.5; for the Euclidean distance, the 
value of the likelihood ratio test is 1.7, confirming the non- 
significance of the effect. Nevertheless, the analysis shows 
that the cut-off value of 10 provides the higher diagnsotic 
accuracy. 

Meta-analysis of magnetic resonance spectroscopy and 
prostate cancer. 

This case study provides an example where the use 
of a global or local criterion leads to a different 
conclusion. Magnetic resonance spectroscopy has the 
ability to discriminate between prostate cancer and 
benign prostatic hyperplasia, based on reduced citrate 
and elevated choline in the cancerous region. The diag- 
nostic test works on a voxel of signal intensity ratios 
of (choline+ creatine) /citrate. Two cut-off points are in 
use: < 0.75 and < 0.86. The results collected in a 
meta-analysis by Wang et al. [30] include 12 studies, as 
presented in Table 8; the associated SROC diagram is pre- 
sented in Figure 10. From the graph, there is no obvious 
choice for the "best" cut-off value. 

The fixed effects parts of the mixed model analysis, 
using the global PH measure and the local Euclidean mea- 
sure as criteria, are presented in Table 9. It is interesting 
to note that the focus of the analysis, global or local, 

Table 7 Analysis of the cut-off effect for the meta-analysis 
of the MOOD module and depression in patients in 
primary care 



Criterion 



Effect 
estimate 



Parameter 



SE 



Z-value 



PH measure cut-off 0.4804 0.3966 1.21 

Euclidean distance cut-off 0.0563 0.0430 1.31 



Table 8 Meta-analysis of the magnetic resonance 
spectroscopy and prostate cancer; TP = true positives, 
FN = false negatives, FP = false positives, TN = true 
negatives 



Study 


Cut-off 


TP 


FN 


FP 


TN 


1 


0.75 


122 


30 


35 


55 


2 


0.75 


73 


8 


80 


219 


3 


0.75 


75 


6 


92 


207 


4 


0.75 


123 


39 


38 


50 


5 


0.75 


134 


21 


40 


39 


6 


0.75 


12 


12 


7 


75 


7 


0.86 


81 


71 


24 


59 


8 


0.86 


56 


25 


32 


267 


9 


0.86 


52 


29 


20 


59 


10 


0.86 


98 


57 


20 


59 


11 


0.86 


6 


9 


15 


266 


12 


0.86 


44 


8 


32 


264 



is an important part of the analysis. Globally, the better 
diagnostic accuracy is given by the cut-off value of 0.75, 
whereas better local performance is achieved with a cut- 
off value of 0.86, although neither analysis is significant. 

PH measure and positive likelihood ratio 

Another frequently used diagnostic measure is the posi- 
tive likelihood ratio, defined as the ratio of sensitivity to 
false positive rate P(T = l\D = l)/P(T = l\D = 0) 
or p/(l — q). It is different to the PH measure in that 
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Figure 1 0 SROC diagram for the meta-analysis of the magnetic 
resonance spectroscopy and prostate cancer. 
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Table 9 Analysis of the cut-off effect for the meta-analysis 
of the magnetic resonance spectroscopy and prostate 
cancer 



Criterion 


Effect (reference) 


Parameter 


SE 


Z-value 






estimate 






PH measure 


cut-off (< 0.75) 


0.2049 


0.3516 


0.58 


Euclidean distance 


cut-off (< 0.75) 


-0.0212 


0.0573 


-0.37 



the ratio is taken on the log-scale: 6 = logj?/log(l — 
Furthermore, if re-expressed as models, the positive- 
likelihood ratio corresponds to p = 0'(1 — q), a straight 
line with no intercept, whereas the the PH measure cor- 
responds to p = (1 — q)° , a straight line on the log-scale 
with no intercept. The positive likelihood ratio is a natu- 
ral measure since it transfers the concept of relative risk 
(risk of a positive test in the diseased group to the risk of 
a positive test in the non-diseased group) to the diagnos- 
tic setting. However, it is less suitable as an (S)ROC model 
since it does not provide a function which connects the 
lower left vertex with the upper right vertex in the ROC 
diagram (which, in contrast, the PH-model does provide). 

Conclusions 

The approach presented here is attractive since it is based 
on a simple measure of diagnostic accuracy per study, 
namely the ratio of log-sensitivity to log-false-positive 
rate. It also embeds the diagnostic meta-analysis problem 
into the well-known and much used mixed model setting. 
In particular, the analysis of effects of observed covariates 
on the diagnostic accuracy can easily be incorporated. 

Controversies in the meta-analysis of diagnostic stud- 
ies usually focus on comparability of studies. Study types 
might be case-control, cohort, cross-sectional or other. 
Studies might differ in the gold standard, severity of dis- 
ease, or in the application of the diagnostic test. Patient 
populations might differ across studies, as might the cut- 
off value (defining positivity of the diagnostic test). All 
these different aspects, if observed, can be easily incorpo- 
rated and analyzed as fixed effects in the special mixed 
model suggested here. 

The occurrence of heterogeneity in the meta-analysis of 
diagnostic studies is more the rule than the exception; it is 
thus important to quantify the heterogeneity across stud- 
ies due to the different sources. The approach provided 
here offers a more detailed investigation of heterogeneity 
according to the various observed sources and a resid- 
ual heterogeneity (measured by r 2 ). This might allow us 
to construct a measure of relative residual heterogeneity, 
which might help to assess how trustworthy the results of 
a given meta-analysis may be. This will be investigated in 
future research. 

In a recent study on the meta-analytical evaluation 
of coronary CT angiography studies, Schuetz et al [31] 



investigated the problem of non-evaluable results that 
occur in the individual studies. They conclude that diag- 
nostic accuracy measures change considerably depending 
on how non-evaluable results are treated. In fact, they 
conclude that 

parameters for diagnostic performance significantly 
decrease if non-evaluable results are included by a 3 x 2 
table for analysis (intention to diagnose approach). 

Twenty-six studies were included in their meta-analysis 
with a wide range of non-evaluable results from 0 to 43. 
Using the approach suggested here, it would be very easy 
to analyze the effect of non-evaluable results on the diag- 
nostic accuracy by including the amount of non-evaluable 
results per study as a fixed effect in the proposed mixed 
model. 
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